CPSC 545/445 (Autumn 2003) - Class 18: Gene Expression Analysis (2) Module 6, Part 2 ---- brief recap: key idea behind microarry technology -- (6.1 continued) Serial Analysis of Gene Expression (SAGE) - Velculescu et al., 1995 Key ideas: * A short sequence tag (10-14 bases) contains sufficient information to uniquely identify a transcript (mRna, cDNA, etc.) * Sequence tags can be linked together to form concatemers = long molecules that can be cloned and sequenced * Quantitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript [from http://www.sagenet.org/] Basic steps of SAGE: 1. Trap RNAs with beads (using magnetic beads with attached poly-T oligos) 2. Convert the RNA into cDNA (using reverse transcriptase) 3. Make a cut in each cDNA so that there is a broken end sticking out (using restriction enzyme, recognition sequence is typically 4bp long) 4. Attach a "docking module" to this end; here a new enzyme can dock, reach down the molecule, and cut off a short tag (using another restriction enzyme that cuts away from its recognition site, which is located in the docking module) 5. Combine two tags into a unit, a di-tag (using ligase) 6. Make billions of copies of the di-tags (using PCR) 7. Remove the modules and glue the di-tags together into long concatamers (using restriction enzymes and ligase to join blunt ends) 8. Put the concatamers into bacteria and copy them millions of times (using standard cloning vector) 9. Pick the best concatamers (length = 25-75 tags) and sequence them using high-throughput sequencer 10. Use software to identify how many different cDNAs there are, and count them; 11. Match the sequence of each tag to the gene that produced the RNA. (alignment against existing genomic data) [based on http://www.embl-heidelberg.de/info/sage/, for a more detailed description, see http://www.genetics.pitt.edu/sage/#over] Note: Step 5 (ditag formation) is important in order to recognise and prevent asymmetric PCR amplification of tags in Step 6. Ditags help to avoid this problem - since the probability for forming each ditag is very small, presence of repeated ditags indicates invalid PCR result. sources of errors: * Tags may not always uniquely identify transcripts * One transcript may have more than one tag (if multiple poly-A sites are present) * Some transcripts may not contain restriction site for anchoring enzyme -> won't be detected by SAGE * restriction enzymes are not 100% precise -> varying length of tags, misinterpretation of ditags (e.g., 14+14=28=12+16) major advantage of sage compared to microarrays: - can identify unknown targets (gene finding) - can be quantitatively very precise -- briefly mention: oligonucleotide fingerprinting (ONF) - Lennon & Lehrach, 1991 - see Shamir & Sharan, 2002 originally motivated by "sequencing by hybridisation" spot cDNA on membrane (up to 31,000) sequentially hybridise with p oligos (7-12 bases long) -> fingerprint vector with p 0's/1's for each cDNA spot (barcode analogy) estimate expression level from #spots with similar / identical fingerprints (identify cDNAs for same gene by clustering) briefly mention: new techniques for detecting proteins directly (protein arrays; based on various capturing techniques, such as antigen-antibody, specific protein/protein, adaptamer/protein, receptor/ligand, enzyme/sustrate interactions) (not covered here) ---- 6.2 clustering gene expression data goal: identify groups of genes with sim expression patterns -> clustering of gene expression data: given: elements = genes characteristic vector for each gene = expression level under various conditions (each element of this vector stems from one expression level measurement) measure of similarity between pairs of vectors (e.g., corr coeff of vectors) (can also use measure of distance) want: partition elements (genes) into subsets (clusters) such that: - elements in same cluster are highly similar (homogeneity) - elements from differ clusters have low similarity (separation) -> optimisation problem [hom = avg sim between elements and their clusters -> min] [sep = weigthed avg sim betw clusters -> max] note: - can use various measures of similarity; homogeneity & separation - objectives (hom, sep) often conflict clustering algorithms: (many developed for broad range of applications) - in many cases, "clean" clustering alg with proven properties were developed first; then more efficient heuristic alg derived from these. -- hierarchical clustering: - bottom up (agglomerative clustering, see UPGMA) or top down (repeated partitioning of sets of elements) agglomerative hierarchical clustering: - input dissimilarity matrix D=(d_ij) - start with partitioning into singleton sets - in each step, join two clusters with minimal dissimilarity (max sim) - update dissimilarity matrix - output: dendrogram (tree of hierch cluster struct) ordered fingerprint matrix [slide: algorithm outline for agglom hier clustering] note: several variants, differ in def of dissim between clusters (upd matrix) (diff parameters alpha_*, gamma in outline)s -> shamir & sharan, 2002, Section 11.4.1 -> Cluster / TreeView software (Eisen et al., 1998) (similarity measure: form of correlation coefficient) -- k-means (first proposed in 1965!): - assume number of cluster is known - given: fingerprint vectors - goal: minimise distance between elements and centroids of assigned cluster for cluster j, c(j) = centroid of cluster j = mean of fingerprint vectors in j for element i, P(i) = cluster assigned to i by partition P d(v1,v2) = Euclidian distance between (fingerprint) vectors v1,v2 K-means tries to find partition P such that E_p = sum_i=1..n d(i,c(P(i)) is minimised idea: local search over partitions: each iteration modifies current P by moving one element between clusters to reduce E_p [slide: alg outline] (can also move in one iter all elem that would benefit from move) -> easy to implement, used in many applications variant of Herwig et al. (1999) works for unspecified number of clusters -- CAST - Clustering Affinity Search Technique (Ben-Dor et al., 1999) - motivated by polynomial time alg for finding true clustering with high prob under prob model of data (Ben-Dor et al., 1999) input: similarity matrix S affinity of elem v to (putative) cluster C: a(v) = sum{S(i,v) | i \in C } idea: generate clusters one by one; start cluster with single element, repeat add element with max affinity a if a > t|C| remove element with min affinity a if a <= t|C| until no change additional iterative improvement local search at end of alg [slide: alg outline of CAST], t is a parameter - implemented in package BIOCLUST -- other methods: - HCS, CLICK: recent clustering methods based on graph theoretic approach - self-organising maps (Kohonen networks) "unsupervised learning alg" assumes known number of clusters iterative method implemented in GeneCluster software (Tamayo et al., 1999) - large margin classifiers also: support vector machines, boosting -- notes on clustering (from conclusions of Shamir and Sharan - clustering is (to some extent) an art - no universal, agreed-upon criteria for eval of solutions - no ultimate alg (problem too general; but true even for gene expression) - further improvement of alg needs more data (real + synthetic) --- Resources: * Roded Sharan, Ron Shamir: Algorithmic Approaches to Clustering Gene Expression Data In: T. Jiang, T. Smith, Y. Xu, M.Q. Zhang (eds.) Current Topics in Computational Biology, pages 269–300. MIT Press, Cambridge, Massachusetts, 2002. Further Reading: ---